In order to make a first evaluation of the given datasets, we compute some basic metrics.

For more information on the metrics and also the extraciton of metrics for the smaller datasets look at:

`Evaluation metrics for picking an appropriate data set for our goals.ipynb `

For the importing the four largest datasets to postgresql and evaluating their metrics look at:

`Importing the large data sets to psql and computing their metrics.ipynb`

Finally, the evaluated metrics of all datasets are exported to metadata and imported here to visualize.


In [1]:
def percentage(some_float):
    return '%i%%' % int(100 * some_float)

def metrics_comparison_matrix(reviews_df):
    return reviews_df.apply(
        lambda row: 
            [ percentage(row[i]) for i in range(0, 5) ] 
            + [ int(row[5]), row[6], row[7] ], 
        axis=1)

In [2]:
import pandas as pd

small_data_metrics = pd.read_csv('./metadata/initial-data-evaluation-metrics.csv')
large_data_metrics = pd.read_csv('./metadata/large-datasets-evaluation-metrics.csv')

In [3]:
metrics = metrics_comparison_matrix(
    pd.concat([ small_data_metrics, large_data_metrics ])
        .set_index('dataset_name'))

In [5]:
metrics.to_csv('./metadata/all-metrics-formatted.csv')
metrics


Out[5]:
1 2 3 4 5 number_of_reviews reviews_per_product reviews_per_reviewer
dataset_name
Amazon Instant Video 4% 5% 11% 22% 56% 37126 22.033234 7.237037
Apps for Android 10% 5% 11% 20% 51% 752937 57.001817 8.627574
Automotive 2% 2% 6% 19% 68% 20473 11.156948 6.992145
Baby 4% 5% 10% 20% 58% 160792 22.807376 8.269067
Beauty 5% 5% 11% 20% 57% 198502 16.403768 8.876358
Cell Phones and Accessories 6% 5% 11% 20% 55% 194439 18.644069 6.974389
Clothing Shoes and Jewelry 4% 5% 10% 20% 58% 278677 12.099032 7.075355
Digital Music 4% 4% 10% 25% 54% 64706 18.135090 11.677676
Grocery and Gourmet Food 3% 5% 11% 21% 57% 151254 17.359578 10.302704
Health and Personal Care 4% 4% 9% 19% 61% 346355 18.687547 8.970836
Home and Kitchen 4% 4% 8% 19% 63% 551682 19.537557 8.293600
Kindle Store 2% 3% 9% 25% 58% 982619 15.865583 14.403046
Office Products 2% 3% 9% 28% 56% 53258 22.007438 10.857900
Patio Lawn and Garden 3% 5% 12% 25% 53% 13272 13.796258 7.871886
Pet Supplies 5% 5% 10% 17% 60% 157836 18.547121 7.949033
Sports and Outdoors 3% 3% 8% 21% 63% 296337 16.142997 8.324541
Tools and Home Improvement 3% 3% 8% 21% 63% 134476 13.161985 8.082462
Toys and Games 2% 3% 9% 22% 61% 167597 14.055434 8.633680
Video Games 6% 5% 12% 23% 51% 231780 21.718516 9.537094
Books 3% 4% 10% 24% 55% 8898040 24.180639 14.739956
CDs and Vinyl 4% 4% 9% 22% 59% 1097592 17.031982 14.584390
Electronics 6% 4% 8% 20% 59% 1689188 26.812082 8.779427
Movies and TV 6% 6% 11% 22% 53% 1697533 33.915388 13.694200